20th March,2024
EDA created by Jahanavi Desai
UBC - Data Visualization in Python
This notebook will be showing some exploratory data analysis for the subset of Vancouver Street Trees dataset located here. Here I am analyzing the Vancouver Street Trees dataset. The data were obtained from The city of Vancouver's Open Data Portal and follows an Open Government License – Vancouver here.

Vancouver is known for its beauty of mountains,trees, oceans, lakes and its beautiful scenic views.The greenary in neighbourhood makes me just fall in love with nature. It makes me wonder that with the growing economical growth how has the tree plantation changed over years? Does it affect the number of trees around us? By looking at the statical data on city of vancouver website here its fascinating seeing all these changes in different neighbourhood and its trees plantation.Are the some species of trees are planted more than others? Whats the relation between the diameter and the height? We will be able to address these questions using an interactive dashboard.
In this analysis, I will be investigating a question associated with the collection of Vancouver Street Trees datasets. I am interested in finding out: 1) Which one of the species has the highest diameter and height range in Kerisdale neighbourhood, i am curious to know as i live in this neighbour, i want to see which species are these trees that i see daily? 2) How useful is the map representation with the interactive scatter point plot with Diameter and Height for each Neighbourhood? 3) Number of trees planted in each Neighbourhood by every planting Area? 4) Number of trees on the even side and odd side of the street? 5) Make a interactive Dashboard, is it possible with all interactivity?
The below descriptions were taken directly from the website,The city of Vancouver's Open Data Portal and follows an Open Government License – Vancouver,where the datasets were obtained.
The dataset contains street tree data from Vancouver, including information such as tree species, diameter, location, and age. Let's begin by understanding the structure of the dataset and exploring the columns of interest:
The data that we will be using through the subset data from this URL here
Our Data Scheme:
| Column | Description |
|---|---|
| tree_id | Tree's unique ID |
| civic_number | Street address at which the tree is located |
| std_street | Street name at which the tree is located |
| genus_name | Genus name of the tree |
| species_name | Street name of the tree |
| cultivar_name | cultivar name of the tree |
| common_name | common name of the tree |
| assigned | Indicates whether the address is made up to associate the tree with a nearby lot (Y=Yes or N=No) |
| root_barrier | Root barrier installed (Y = Yes, N = No) |
| plant_area | B = behind sidewalk, C = cutout, G = in tree grate, L = lane, N = no sidewalk, P = park. Numeric value indicates boulevard width in feet |
| on_street_block | The street block at which the tree is physically located on |
| on_street | The name of the street at which the tree is physically located on |
| neighbourhood_name | City's defined local area in which the tree is located. |
| street_side_name | The street side which the tree is physically located on (Even, Odd or Median (Med)) |
| height_range_id | 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and 10 = 100+ ft) |
| height_range | Height range of the tree measured in feet |
| diameter | DBH in inches (DBH stands for diameter of tree at breast height) |
| curb | Curb presence (Y = Yes, N = No) |
| date_planted | date of the tree planted YYYY-MM-DD format |
| latitude | latitude of the tree is located |
| longitude | longitude of the tree is located |
Data Wrangling and learning about the Data
# Lets import all the required libraries needed for this analysis
import os
import numpy as np
import altair as alt
import pandas as pd
#alt.data_transformers.enable("data_server")
Lets see what the tables look like.
trees_df = pd.read_csv(
"https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv",
parse_dates=["date_planted"],
)
trees_df.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
5 rows × 21 columns
Lets get some other information about the trees dataframe table.
trees_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5000 non-null int64 1 std_street 5000 non-null object 2 on_street 5000 non-null object 3 species_name 5000 non-null object 4 neighbourhood_name 5000 non-null object 5 date_planted 2363 non-null datetime64[ns] 6 diameter 5000 non-null float64 7 street_side_name 5000 non-null object 8 genus_name 5000 non-null object 9 assigned 5000 non-null object 10 civic_number 5000 non-null int64 11 plant_area 4950 non-null object 12 curb 5000 non-null object 13 tree_id 5000 non-null int64 14 common_name 5000 non-null object 15 height_range_id 5000 non-null int64 16 on_street_block 5000 non-null int64 17 cultivar_name 2658 non-null object 18 root_barrier 5000 non-null object 19 latitude 5000 non-null float64 20 longitude 5000 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(5), object(12) memory usage: 820.4+ KB
The sets table has $579$ rows and $6$ columns.
We have null values in three columns which are 'date_planted','plant_area' and 'cultivar_name'
trees_df.describe()
| Unnamed: 0 | diameter | civic_number | tree_id | height_range_id | on_street_block | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 14861.920400 | 12.340888 | 2975.707600 | 128682.584600 | 2.73440 | 2960.227000 | 49.247349 | -123.107128 |
| std | 8680.023278 | 9.266600 | 2078.580429 | 75412.260406 | 1.56957 | 2086.861052 | 0.021251 | 0.049137 |
| min | 2.000000 | 0.000000 | 2.000000 | 36.000000 | 0.00000 | 0.000000 | 49.202783 | -123.220560 |
| 25% | 7192.750000 | 4.000000 | 1300.500000 | 61321.500000 | 2.00000 | 1300.000000 | 49.230152 | -123.144178 |
| 50% | 14870.000000 | 10.000000 | 2639.000000 | 130130.500000 | 2.00000 | 2600.000000 | 49.247981 | -123.105861 |
| 75% | 22366.750000 | 18.000000 | 4123.000000 | 191332.000000 | 4.00000 | 4100.000000 | 49.263275 | -123.063484 |
| max | 29992.000000 | 71.000000 | 9113.000000 | 270750.000000 | 9.00000 | 9100.000000 | 49.293930 | -123.023311 |
There are certain columns which has null values and are of no use such as cultivar_name,on_street
del trees_df["cultivar_name"]
del trees_df["on_street"]
Lets utilize the plant area column to get the best insights, first lets deal with the missing values and the case of the values
#Changing the case of certain column value to its capital forms , and M which we dont know as unknown:
trees_df['plant_area'] = trees_df['plant_area'].replace('c', 'C')
trees_df['plant_area'] = trees_df['plant_area'].replace('b', 'B')
trees_df['plant_area'] = trees_df['plant_area'].replace('g', 'G')
trees_df['plant_area'] = trees_df['plant_area'].replace('M', 'Unknown')
#change the NA values to unknown as well:
trees_df['plant_area'] = trees_df['plant_area'].fillna('Unknown')
We have to deal with the null values in the data column: As we know - Date is in YYYY-MM-DD format. Planted date of new trees is added after every planting season, usually at the beginning of January and June.
# This converts the column values into date time values and it gives NaT to missing values:
trees_df['date_planted'] = pd.to_datetime(trees_df['date_planted'], errors='coerce')
# making a function to fill the missing values by month of January or June as per the source data - says when they are planted usually
def missing_date_values(value):
if pd.isnull(value['date_planted']):
if pd.Timestamp.now().month < 6:
return pd.Timestamp(year=pd.Timestamp.now().year, month=1, day=1)
else:
return pd.Timestamp(year=pd.Timestamp.now().year, month=6, day=1)
else:
return value['date_planted']
# Placing the values after doing the loop with our function on the column and placing return value in the column value = 1
trees_df['date_planted'] = trees_df.apply(missing_date_values, axis=1)
trees_df['date_planted']
0 2000-02-23
1 1992-02-04
2 2024-01-01
3 1999-11-12
4 2024-01-01
...
4995 2024-01-01
4996 2014-01-14
4997 2002-04-15
4998 2003-12-02
4999 2024-01-01
Name: date_planted, Length: 5000, dtype: datetime64[ns]
In this dataframe i think we have certain columns that we dont need it for our analysis. We can drop certain columns or we should just make a child dataframe from this trees df with the columns we need.
columns_to_use = [
"species_name",
"neighbourhood_name",
"diameter",
"date_planted",
"plant_area",
"height_range_id",
"street_side_name",
"latitude",
"longitude",
]
new_trees_df = trees_df[columns_to_use].copy()
new_trees_df
| species_name | neighbourhood_name | diameter | date_planted | plant_area | height_range_id | street_side_name | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | PLATANOIDES | Riley Park | 28.5 | 2000-02-23 | 15 | 4 | EVEN | 49.252711 | -123.106323 |
| 1 | CALLERYANA | Arbutus-Ridge | 6.0 | 1992-02-04 | 7 | 2 | ODD | 49.256350 | -123.158709 |
| 2 | NIGRA | Sunset | 12.0 | 2024-01-01 | 7 | 4 | ODD | 49.213486 | -123.083254 |
| 3 | AMERICANA | Killarney | 11.0 | 1999-11-12 | 7 | 4 | EVEN | 49.220839 | -123.036721 |
| 4 | HIPPOCASTANUM | Shaughnessy | 15.5 | 2024-01-01 | N | 4 | ODD | 49.238514 | -123.154958 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | SERRULATA | Victoria-Fraserview | 17.0 | 2024-01-01 | 9 | 2 | EVEN | 49.221161 | -123.061023 |
| 4996 | XX | Kensington-Cedar Cottage | 3.0 | 2014-01-14 | 10 | 1 | EVEN | 49.241544 | -123.070644 |
| 4997 | TULIPIFERA | Killarney | 3.5 | 2002-04-15 | 7 | 2 | EVEN | 49.224511 | -123.048723 |
| 4998 | INVOLUCRATA | Mount Pleasant | 5.5 | 2003-12-02 | 5 | 1 | EVEN | 49.259208 | -123.096905 |
| 4999 | CAMPESTRE | Kensington-Cedar Cottage | 3.0 | 2024-01-01 | 8 | 1 | ODD | 49.243772 | -123.078967 |
5000 rows × 9 columns
new_trees_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species_name 5000 non-null object 1 neighbourhood_name 5000 non-null object 2 diameter 5000 non-null float64 3 date_planted 5000 non-null datetime64[ns] 4 plant_area 5000 non-null object 5 height_range_id 5000 non-null int64 6 street_side_name 5000 non-null object 7 latitude 5000 non-null float64 8 longitude 5000 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(1), object(4) memory usage: 351.7+ KB
Now we have a pretty clean dataframe to work on with no null values and with only the columns that are of use.
For our first visualization, let's get the answer for our first question:
new_trees_df['year'] = new_trees_df['date_planted'].dt.year
new_trees_df['year']
0 2000
1 1992
2 2024
3 1999
4 2024
...
4995 2024
4996 2014
4997 2002
4998 2003
4999 2024
Name: year, Length: 5000, dtype: int64
Lets make a interactive legend to get the height and diameter of the respective Neighbourhood:
legend_city = alt.selection_multi(fields=["neighbourhood_name"], bind="legend")
city_plot = (
alt.Chart(new_trees_df)
.mark_circle()
.encode(
x=alt.X("diameter:Q", title="Diameter of the Tree - inches"),
y=alt.Y("height_range_id:Q", title="Height Range"),
opacity=alt.condition(legend_city, alt.value(1), alt.value(0)),
color=alt.Color("neighbourhood_name:N", title="Neighbourhood"),
tooltip=[
"neighbourhood_name:N",
"species_name:N",
"diameter:Q",
"height_range_id:Q",
],
)
.add_selection(legend_city)
.properties(
width=550,
)
.interactive()
)
Its hard to see which species has the highest height or diameter overall in each neighbourhood.
First we will get all the unique values of the neighbourhood Columns:
neighbourhoods = sorted(new_trees_df["neighbourhood_name"].unique())
neighbourhoods
['Arbutus-Ridge', 'Downtown', 'Dunbar-Southlands', 'Fairview', 'Grandview-Woodland', 'Hastings-Sunrise', 'Kensington-Cedar Cottage', 'Kerrisdale', 'Killarney', 'Kitsilano', 'Marpole', 'Mount Pleasant', 'Oakridge', 'Renfrew-Collingwood', 'Riley Park', 'Shaughnessy', 'South Cambie', 'Strathcona', 'Sunset', 'Victoria-Fraserview', 'West End', 'West Point Grey']
we will make a dropdown selection for all the neighbourhoods we have.
dropdown_neighbourhood = alt.binding_select(
options=neighbourhoods, name="Neighbourhood"
)
select_neighbourhood = alt.selection_single(
fields=["neighbourhood_name"],
bind=dropdown_neighbourhood,
name='neighbourhood name')
Here we will make a dataframe object having top 5 species in the neighbourhood:
top_species_df = trees_df[
trees_df["species_name"].isin(
trees_df["species_name"].value_counts().nlargest(5).index
)
]
top_species = sorted(top_species_df["species_name"].unique())
top_species
['AMERICANA', 'CERASIFERA', 'PLATANOIDES', 'RUBRUM', 'SERRULATA']
we will add the species into our combined interactive chart so that we can select which species out of the top 5 has the highest diameter and height:
radio_top_species = alt.binding_radio(options=top_species, name="Top 5 Species")
select_top_species_and_neighbourhood = alt.selection_single(
fields=["neighbourhood_name", "species_name"],
bind={
"species_name": radio_top_species,
"neighbourhood_name": dropdown_neighbourhood,
},name='Species'
)
Adding the color to the output points and adding the selection for species and neighbourhood to the city_radio_plot
city_radio_plot = (
city_plot.encode(
color=alt.value('purple')
)
.transform_filter(select_top_species_and_neighbourhood)
.add_selection(select_top_species_and_neighbourhood)
).properties(title='Diameter and Height of a Tree for each species and neighbourhood')
city_radio_plot
Here is a postive relation between Diameter of a tree and its height range
Answer :Here its clear that Plantanoides species has the highest height_range and diameter in Kerrisdale by 5 and 36.5 respectively. We can find more insights from the given interactive chart which is the least diameter in each species and neighbourhood having h.
click = alt.selection_single(fields=['neighbourhood_name'])
scatter_points = (
alt.Chart(new_trees_df)
.mark_point()
.encode(
x=alt.X('diameter:Q', title="Diameter", scale=alt.Scale(zero=False)),
y=alt.Y('height_range_id:Q', title="Height Range ID"),
color=alt.condition(click,'diameter:Q', alt.value('white')),
opacity=alt.condition(click,alt.value(0.9), alt.value(0)),
tooltip=['neighbourhood_name','diameter:Q','height_range_id:Q','species_name:N']
).add_selection(click)
.properties(height=300)
).interactive()
scatter_points
For adding the vancouver map to the plot, first we will load the data from this URL and then store it in an dataframe variable
url_json = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson"
data_map = alt.Data(
url=url_json, format=alt.DataFormat(property="features", type="json")
)
data_map
Data({
format: DataFormat({
property: 'features',
type: 'json'
}),
url: 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
})
Making is a base map is essential, so that we can directly add the specification from our new_trees_df later
vancouver_map = (
alt.Chart(data_map)
.mark_geoshape(color="black", opacity=0.5, stroke="gray")
.encode()
.project(type="identity", reflectY=True)
)
vancouver_map
making a map with all the neighbourhood data we have in our new_trees_df, so we will use the key which is neighbourhood name and match it in both dataframe object which gives us a beautiful vancouver map.
vancouver_diameter_map = (alt.Chart(data_map).mark_geoshape()
.transform_lookup(
lookup='properties.name',
from_=alt.LookupData(new_trees_df, 'neighbourhood_name', ['diameter', 'neighbourhood_name'])
)
.encode(
color=alt.Color('diameter:Q', title='Average Diameter'),
opacity=alt.condition(click, alt.value(1), alt.value(0.2)),
tooltip=['neighbourhood_name:N', 'diameter:Q']
).add_selection(click)
.project(type="identity", reflectY=True)
.properties(title='Average Diameter of Trees in Each Neighbourhood')
)
# Display the chart
vancouver_diameter_map
I want to set the title of the plot in the center
vancouver_diameter_map.transform_filter(click)
(vancouver_diameter_map | scatter_points).properties(title = 'Diameter and Height Range of trees in Vancouver').configure_title(anchor='middle')
Answer :This chart is visually appealing and it is giving information with interactivity such as diameter and height range. But widget chart is prefered to get this information or even a color legend.
# Aggregate number of trees per neighborhood
neighborhood_tree_counts = trees_df['neighbourhood_name'].value_counts().reset_index()
neighborhood_tree_counts.columns = ['neighbourhood_name', 'tree_count']
new_trees_df = new_trees_df.merge(neighborhood_tree_counts, on='neighbourhood_name', how='left')
making a selection for the neighbourhood and then making a bar chart for representation
slider = alt.binding_range(min=1990, max=2024, step=1)
slider_selection = alt.selection_single(fields=['year'], bind=slider, name='Year')
bars = alt.Chart(new_trees_df).mark_bar().encode(
y=alt.Y('neighbourhood_name:N', title='Neighborhood'),
x=alt.X('tree_count:Q', title='Number of Trees Planted'),
tooltip=['neighbourhood_name:N', 'tree_count:Q'],
color=alt.condition(slider_selection, alt.value('skyblue'), alt.value('orange'))
).properties(
width=600, height=400, title='Number of Trees Planted per Neighbourhood'
).add_selection(
slider_selection
)
bars
We have plant area and we have values in both categorical and numerical, we can utilize this plot to see which details do we have per species per neighbourhood
brush = alt.selection_interval(encodings=['x', 'y'])
base_map = alt.Chart(new_trees_df).mark_circle(size=20).encode(
alt.Y('plant_area:N'),
alt.X('count():Q'),
color=alt.condition(brush,'neighbourhood_name:N',alt.value('lightgray'),legend=None),
tooltip=['plant_area:N', 'diameter:Q','species_name:N','neighbourhood_name:N']).add_selection(brush).properties(
width=600,
)
base_map
base_map = base_map.encode(
color=alt.condition(click,'neighbourhood_name', alt.value('white'))
).transform_filter(slider_selection).properties(title='The number of trees in each Area')
Answer :This graph is very user friendly with the interactivity it has, it seems easy to navigate and check for the number of trees planted and its area type.
we will just make a object that will use instead of the whole dataframe, which includes the count of each street side name :
side_counts = trees_df["street_side_name"].value_counts().reset_index()
side_counts
| index | street_side_name | |
|---|---|---|
| 0 | ODD | 2554 |
| 1 | EVEN | 2348 |
| 2 | MED | 94 |
| 3 | BIKE MED | 4 |
side_counts.columns = ["street_side_name", "side_count"]
For a better comparison of the number of trees planted on each side , circle plot works the best
# Create Altair scatter plot with circles
circle_plot = (
alt.Chart(side_counts)
.mark_circle(size=100)
.encode(
x=alt.X("side_count:Q", title="Number of Trees"),
y=alt.Y("street_side_name:N", title="Street Side"),
color=alt.Color("street_side_name:N", title="Street Side"),
tooltip=["street_side_name:N", "side_count:Q"]
)
.properties(title="Number of Trees on Even and Odd Sides of the Street",width=500,height=200)
).configure_axis(grid=True)
circle_plot
Here its pretty clear that we have highest number of trees planted on the odd side of the street which is 2554.
dashboard = (
((vancouver_diameter_map | base_map) & (city_radio_plot | bars))
.properties(
title={
"text": "Interactive Dashboard for Vancouver Trees Dataset",
"subtitle": "Exploring tree distribution and characteristics in Vancouver",
"subtitleColor": "gray",
"align": "center",
}
)
.configure_view(height=400, width=500)
.resolve_scale(color="independent")
.add_selection(click)
)
dashboard
In this EDA i have got the best information out for the trees in Vancouver. I feel the number of trees per year could have a been a great way to see the change over the years in plantation and what we need to do to make it consistent ahead. Apart from these, i have got the top 5 species of trees we have in the city and how their height and diameters correlates in a postive direction which leads to make a point that more the diameter of a tree is, chances of it to grow taller is higher. Still it varies in different neighbourhood and for different species. Furthermore, we have got the interactive bar chart and a scatter plot showing the respective Area the tree is planted per neighbourhood and the total number of trees planted. Making the dashboard is the fun part, adding all four charts together, making it interact with eachother, the 2 widgets in one chart and click which works for selection in the map and the bar chart.